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Abstract 

In this paper we propose new techniques to sample arbitrary third-order tensors, with an 
objective of speeding up tensor algorithms that have recently gained popularity in machine learn¬ 
ing. Our main contribution is a new way to select, in a biased random way, only of 

the possible v? elements while still achieving each of the three goals: 

(a) tensor sparsification: for a tensor that has to be formed from arbitrary samples, compute 
very few elements to get a good spectral approximation, and for arbitrary orthogonal tensors 

(b) tensor completion: recover an exactly low-rank tensor from a small number of samples via 
alternating least squares, or (cj tensor factorization: approximating factors of a low-rank tensor 
corrupted by noise. 

Our sampling can be used along with existing tensor-based algorithms to speed them up, re¬ 
moving the computational bottleneck in these methods. 


1 Introduction 

Tensors capture higher order relations in the data. C omputing fac t ors o f higher order tensors 


roscience 

(Acar et al.. 

2007; 

Kolda & Bader. 

20091 and recentlv of increasing interest in machine 

learning ( 

Signoretto et al. 

. 2014: Yilmaz et al.. 201ll: Liu et al.. 

2 OI 3 I') with applications in learning 


latent variable models like hidden Markov modUs fHM Ms), Gaussian mi xture models a nd latent 
Dirichlet allocation (LDA) ( Anandkumar et al. . 2014a), signal processing ( Comon . 2009l l etc. 

In several / most of these applications, the primary tensor feature of interest is its low-rank 
factoriza tion or approximation. Existing al g orithm s to compute the same, like alternating least 


squares (Carroll fc Chang . 1970| : Harshman . 1970lh tensor power method ( De Lathauwer et al. 


2000 : lAnandkumar et al.l . l2014al l etc. need to access data in every iteration and can be computa¬ 


tionally intensive. This is true both for settings where the tensor is already explicitly formed and 
available, and settings where it needs to be formed by taking appropriate outer products of data 
samples. 

Methods involving tensors are intrinsically more computationally intensive as compared to, for 
example, those involving matrices; the focus of this paper is to provide a new and (at least a-priori) 
non-obvious technique to sample and compute sparse approximation of tensors. 


Our contributions: Our objective is to determine a small (random) subset of elements of a tensor 
that can be taken as a sparse surrogate for the tensor, in the sense that their spectral properties 
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are similar. Our main contribution(s) is to develop a new way to determine this small subset in 
a data-dependent way, so that we can achieve this objective without placing any incoherence-like 
assumptions on the underlying tensor. We focus on three related but distinct settings for our three 
main contributions: 


• Direct tensor sparsification from samples: This focuses on the common setting (especially in 
ML applications) where one is given samples Xi G M” and is interested in spectral properties 
of the outer product tensor T := ® Xi® Xi. We impose no additional structural 

assumptions on the tensor. Naively, this requires computing the elements of the tensor 
first, and random sampling can be quite bad. 


We instead provide a new (biased random) sampling distribution which allows us to choose 
as few as m := 0 {-—elements to compute, yielding a sparse tensor T whose spectral 
error is bounded by T — T , w.h.p. Furthermore our algorithm can 

compute the distribution with just one pass of all the samples (and thus requires two passes 
overall - the second one to actually compute the elements); the computational complexity is 
thus 0{nnz{X) -\- p* m* login)). Sparse tensors are much easier to store and factorize. 


Exact Tensor Completion: In this setting, one wants to exactly recover a rank-r orthogonal 
tensor (i.e. T = Y7i=i <^iUf ® U* ® U*, Uf G M" are orthogonal ), from a small number of 
(randomly chosen) elements - this represents the tensor generalization of the popular matrix 
completion setting. So far it is known that tensors with restrictive incoherenc e conditions 
can b e recovered from a small number of uniformly randomly chosen elements (I.Tain fc OhI . 
2014I L Incoherence however is a restrictive setting that precludes settings with high dynamic 


ranges (e.g. power laws) in element magnitudes. 


We consider the case where the low-rank orthogonal tensor has no additional incoherence 
properties. We show that, if the samples come from a special distribution (that is “adapted” 
to the underlying tensor) then the tensor can be provably exactly recovered from (a) as 
few as m = 0((X]r=i \\{U*y\\^)'^nr^log^(n)) samples w.h.p. (b) using a simple, fast and 
parallel weighted alternating least squares algorithm. The distribution depends only on the 
row norms of U*. The algorithm has a low complexity of 0{mr'^), but performance depends 
on the restricted condition number k. 


• 2-pass Approximate Tensor Factorization: Finally, we consider the problem - again common 
in ML applications - where we are given an arbitrary (large) low-rank orthogonal tensor that 
has been corrupted by arbitrary but bounded noise (T = XT), and we 

would like to approximately recover the orthogonal factors U* fast. 

We provide a new algorithm to do so, that operates in two stages: (a) in the first stage, it takes 
one pass over all the elements of the tensor to determine a sampling distribution, and in a 
second pass extracts log^(n)) elements of the tensor. Then (b) in the second stage, 

it uses these samples to do a tensor completion via weighted alternating least squares (which is 
fast, simple and parallel) to compute approximate factors with error \\Ui — Uf 
with probability > I — 

^Tensors that are not orthogonal have significantly harder algebraic structure. 


< 

— cr' 


I2||g|| I y\e\\F 
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As mentioned, our algorithms needs only two passes over the data, are faster with less memory 
requirements and trivially parallellizable. Towards the end we will present some numerical simula¬ 
tions to illustrate our results. Note that we only discuss results for order-3 symmetric tensors for 
ease of notation. All our results extend to higher order non-symmetric tensors. 

The rest of the paper is organized as follows: In Section [2] we will first present some background 
of tensor factorization and later discuss related work. In Section [3] we will present our results on 
direct tensor sparsification from samples. In Section [4] we will present our exact tensor completion 
results. In Section [5] we will discuss the 2-pass algorithm for computing factors of a tensor. Finally 
we present some results from numerical experiments in Section [H 


2 Background 

In this section we will present some background on tensor factorization and discuss related results. 

Tensor Factorization: An order-3 tensor T G jg of rank-r if the minimum number of 

rank-1 tensors it can be decomposed is r, i.e., T = 'Si vi iSi wi, ui,vi and wi are vectors in 

M”. This decomposition is known as CANDECOMP/PARAFAC (CP) decomposition of a tensor. 
Rank of a tensor denotes the CP rank in the rest of this paper. 

Note that unlike matrices, the components ui need not be all orthogonal. Surp risingly tensors 
have unique decomposition under simple conditions on Kruskal rank of the factors ui (jKolda &: Badeii . 
20091 ). This makes tensor factorization appealing in latent variable learning i n man y applications 


like LDA, HMM, Gaussian mixture models, ICA ( Anandkumar et ah . 2014a . 20121 ') 


_ In general finding factorization or even just the rank of a tensor is NP-hard (jHillar fc Liml . 

2 OI 3 I I. However if the tensor has orthogonal factor i zation then the factors can be co mputed us- 


i ng the tensor power meth od ( De Lathauwer et ah . 2000l : Anandkumar et ah . 2014al ). Recently 


(|Anandkumar et al.l . l2014bll has given guarantees on factoring a tensor with incoherent (low in¬ 


ner product) factors. ( Richard fc Montanari . 20141 ) has analyzed various algorithms for recovering 
underlying factors from a spiked statistical model. Note that these algorithms need to access the 
entire data over multiple iterations. 

In situations where tensor is computed as higher order moment from the samples, one can use 
the sam ple covariance matr i x to p erform whitening and convert the tensor factors into orthogonal 
factors ( Anandkumar et ah . 2ni4al ). This also reduces the problem dimension and one can compute 
the factors fast using the tensor power method. However this technique cannot be used in settings 
where one observes the entries of the tensor directly like ratings in a user*movie*time tensor, or 
EEG signal measuring electrical activity in brain as a time*spectral*space tensor. Our algorithmic] 
in section [5l computes factors fast by sampling few entries of the tensor and doing tensor completion. 

Other popular factorization of a tensor is Tucker decomposition. Here we express tensor 


as a product of 3 orthogonal matrices U £ M 
A e i.e., 7^= EpqrUipVjgWkrApgr. 


nxri 


V G 


pnxr2 


to ( Kolda &: Bader . 2009l i. 


,W G g eore matrix 

Eor a detailed discussion and algorithms we refer 


Tensor Sp arsification: The g oal in tensor sparsification problem is to compute a sparse sketch 
of a tensor. (jTsourakakisl . 120111 ) has given a w ay to compute approx imate factorization of a tensor 
but the approximation is in Erobenius norm. ( Nguven et ah . 2 OIC 1 I I proposed a randomized sam- 
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pling technique to compute sparse approximation. Specifically they sample entries with probability 
proportional to entry magnitude squared. They have given approximation guarantees in spectral 
norm with 0{- —samples. We present similar guarantees in the setting where tensor is not 
already computed but one has access to the sample data (Section [3]). 


Tensor Completion: In tensor completion problem one wants to recover a low rank tensor from 
seeing only few entries of the tensor. There are multiple algorithms prop osed for tensor comple¬ 


tion without guarante es based on weighted least squ ares (lAcar et al.l.l201lh. tra ce norm min i miza¬ 


tion (Liu et a 


Tomioka et al. 


■ I2ni3l i and alternating least squares ( Walczak fc Massartl . boOll ). ( Mu et al 


2014 


201 Ih proposed various equivalents of nuclear norm for tensors and studied the 


problem of tensor completion but under Gaussian linear measurements, different from the setting 
considered here, where one sees the entries of the tensor. 


(jjain fc OhI . l2014l ) has recently shown that one can recover a p-incoherent rank-r orthogonal 
tensor from observing only 0{n^'^{log n)'^ log(r||T||j7’/e)) random entries. The authors use an 
alternating minimization style algorithm to recover the factors of the tensor. In Section [J] we show 
exact recovery for any orthogonal tensor witho ut incoherence a s sump tion from fewer samples, if 
sampled appropriately. Another recent work by (IBarak fc Moitral . 120151 ) has given a sum-of-squares 
hierarchy based algorithm for prediction of incoherent tensors. However they only give approxima¬ 
tion guarantees and not exact recovery. Interestingly their techniques work with general incoherent 
tensors and do not need orthogonality between factors. 


Matrix Completion: In low rank matrix completion one wants to recover a low rank matrix 
froi n seeing only few entries of the matrix . This is a well st udied problem starting wit h the work 
of ( Candes &: Recht . 20091 : Candes &: Taol . 2010 l i and later ( Recht . 20091: Grossl. l201lh using nu¬ 
clear norm minimization algorithu Jl. Other popular algorithms wh ich guarantee exact recovery ar e 
OptSpace ( Keshavan et ah . 201ol i and alternating minimization ( Jain et ah . 20131 : iHar d3, l2013lh 
These results assu me that the unde rlying matrix is incoherent and the entries are sampled uni¬ 
formly at random. Chen et al. ( 20141 ) has given guarantees for recovery of any rank-r, nxn matrix 
from O(nr log^(n)) samples if sampled according to the leverage score distribution d. 


Matrix Approximation/Sparsification: This is another active area with huge amount of in¬ 
teresting literature. Given a matrix M, the goal is to produce a low dimensional approximation 
(sketch) of the matrix with good approximation guarantees and in small number of passes over the 
data. The sketch can be a sparse matrix or a low rank matrix. Given the huge amount of literature 
we will not be able to do justice t o all the works and we direct th e interested reader to the nice 


survey articles ( Halko et ah . 2011 : Mahoney, 2011 : Woodruff . 20141 ) 


Directly rele vant to our 2-pass tensor factp r ization results (Sec t ion!^ are the entrywis e sam¬ 


pling results of (Achlioptas fc McSherrv . 2001 : Drineas Sz Zouziai. 2011 : Achlioptas et ah . 20131 : 


Bhoianapalli et al. 


201 


B) for matrices. In particular ( Achlioptas fc McSherrvl. l200lh proposed 


an 


entrywise sam pling and quantizat i on m ethod for low rank approximation and has given additive 


error bounds. (iBhoianapalli et al.l . l2015l ). has presented a low rank approximation algorithm using 


the leverage score sampling. Our work is similar in spirit to these results for matrices, but the 


^Nuclear norm of a matrix is sum of its singular values. 

®Let SVD of M be U'SV'^, then pij oc ||t/'||^ -|- ||W ||^ is the leverage score distribution. 
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techniques used for matrices like matrix Bernstein inequality do not extend to tensors and newer 
techniques are needed. 


Notation: Capital letter U typically denotes a matrix and calligraphic letter T denote a tensor. 

denotes the i-ih. row of U, Uj denotes the j-th column of U, and Uij denotes the (i, j)-th element 
of U. Unless specified otherwise, U £ and T £ 7 ^^.^ denotes the {i,j,k) element of 

the tensor. ||a:|| denotes the L2 norm of a vector. ||M|| = max||,j,||=i ||Afx|| denotes the spectral or 

operator norm of M. ||M||i? = ^ij denotes the Frobenius norm of M. 

Now define tensor operation on a vector 9 G M” as follows, 

T{i,9,e) = Tijk9 

i jk 

where I is a n x n identity matrix. Spectral norm of a tensor T is defined as follows: 


||T|| = max \\T{I,u,v 


=1 


iri 


= \lYlijk 'Tijk denotes the Frobenius norm of T. 


U C [n] X [n] X [n] usually denotes the sampled set with, Pn{T) is given by: Pn{T)ijk = Tijk 
if {i,j,k) G U and 0 otherwise. Rn{T) = W. * Pq{T) denotes the Hadamard product of W and 

-j /Q 

Pn{T). Similarly let {T)ijk = ^/Pi^ijkTijk if {i,j,k) G U and 0 otherwise. C is a constant 
independent of other parameters of the tensor and can change from line to line. 

denotes the nxn matrix with entry (j, k) being Tijk- Finally [r] denotes the set of integers 
form 1 to r. 


3 Direct Tensor Sparsification from Samples 

In this section we will present a new two pass algorithm for computing a sparse approximation 
of a tensor T = Xi 0 Xi ^ Xi, where Xi are sample vectors in M”. Our algorithm involves 

first computing a specific distribution from X and sampling entries of the tensor according to this 
distribution. Note that our algorithm will not need to form the complete tensor from the samples 
{Xi}, but only compute few entries of the tensor that are sampled. Let X be the sample matrix 
with Xi as columns. Now we present the algorithm in detail. 


Algorithm: { Input: Data X and sparsity m; Output: 0{m) sparse tensor Rq{T).} 

• In one pass over the data X, compute ||A*||, Vi. 

• Generate the sample set 0, where {i,j, k) £ H with probability pijk = mm{m*pijk, 1}, where 

+ \\X^f\\X^f + \\X^f\\X^f 


Pijk — 


3n||A||? 


( 1 ) 


1^113 = 


i||3 
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• In one more pass over the data compute the tensor elements, 

1 / ^ \ 

Rn{T)ijk = -— '^XiiXjiXki , '^{i,j,k) e Q and 0 else. 

Pijk Vz=i / 

The output of the algorithm is the sparse tensor Rq(T), where Rn{T)ijk = if the entry 
is sampled and 0 else. Now we will show that the sampled and reweighed tensor Rfi{T) is a good 
approximation to T in spectral norm. 

Theorem 3.1. Given sample vectors Xi E M”', let T = Y^=i Xi® Xi® Xi. Then the sampled and 
reweighed tensor R^^T) generated according to the distribution ([T]), satisfies the following: 

||/2o(r)-r||<e|^V^*^||X*f^, (2) 

with probability >1 — for m > 0 {-— 

Remarks: 


1. Expected number of sampled entries is < m. Hence the sparsity of the sampled tensor Rq{T) 
is less than 2* m with high probability from concentration of binomial random variables. 


2. The proof of this theorem is discussed in appendix iBl and relies mainly on appropriately par¬ 
titioning the sets of {i,j,k) and bounding error on ea ch partition u s ing th e concentration bounds 


for spectral norm of a random tensor (Theorem IA.3I ( Ngnven et ah . 20101 )). 


3. 


This theorem generalizes to an order-d tensor T = with distribution 


Ph.-.u (d- llxillJ/C-l-ijjJ-i 

and sample complexity m > log^(n)). 


4. We now show that approximating the tensor in spectral norm gives constant approximation to 
the underl ying factors if the t e nsor ha s orthogonal factors using the Robust Tensor Power Method 


(RTPM) ( Anandkumar et ah . 2014al b Intuitively good approximation is possible if the sample 
vectors do not cancel in adversarial way i.e., ^/n* is of the same order as crT^. Such 

an approximation to factors is desirable for initialization of algorithms like tensor power method 
or alte rnating least square s as we will discuss in the Section [H The result follows from Theorem 


5.1 of ( Anandkumar et al. . 2014al b 


Lemma 3.2. Given a tensor T = orthogonal factors s.t. T = ® 

U* (g) U* and €y/n* ^ then, running O(clogr) iterations of RTPM on Rn{T) 
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sampled according to distribution gives Ui and ai satisfying the following. 


\\ui-un\2<c€ 




CFT 




< Ce^/n* (^11^ 


i||3 


vi=l 


( 3 ) 

( 4 ) 


for all I G [r] with probability > 1 — for m > 0{-—Qjid constant C. 

Computation complexity: Computation of the sample distribution needs 0(nnz(X)) (sparsity 
of X) time. Computing m entries of the tensor from the distribution has 0{p*mlog{n)) time. Note 
that both these steps can be implemented in two passes over the data matrix X. On the contrary 
just computing the tensor from the samples takes 0{pn^) time. Further one of the most performed 
operation with tensors, tensor-vector product Rn{T){I,v,v) takes only 0{m) time, independent of 
p, compared to 0{n*p) complexity of computing {T){I,v,v). 


4 Exact Tensor Completion 

In this section we will present our main result on the tensor completion problem. Let T = 
® ® where U* G is an orthonormal matrix. The tensor completion problem 

is to recover the rank-r tensor from observing only few entries. We will show that for any rank-r 
tensor if entries are sampled according to a specific distribution, it can be recovered exactly from 
less than log^(re)) samples using algorithm [TJ 


Sampling: Now we will describe the sampling distribution that is sufficient to show exact recovery 
of any low rank orthogonal tensor from less than log(n)) samples. 


Pijk 


\\{U* 






( 5 ) 


Let m be the number of samples, then element Tljfc is sampled independently with probability 
greater than pijk = min{m * Pijk, !}■ The sampling distribution depends on the row norms of U*. 
We discuss the intuition for this distribution in Section [431 


Algorithm: For recovery of the tensor factors from the samples we use an alternating least squares 
algorithm [T] Define the weights Wjjfc = 1/pijfc when pijk > 0, and 0 else. The algorithm minimizes 
the error, 

V 

UiiUjiUki j , 

in an iterative way as discussed below. For detailed pseudocode look into algorithm [H 

{ Input: Sampled tensor Pq{T), Initialization-?/, weights W, iterations-6; Output: Completed 
tensor T.} 

• Compute the sparse residual tensors, TZg = Pn{T — ^lUi 0 Ui ® Ui), for all g G [r]. 


mm 

f/glRnXr 


E 

iikGft 


w, 


ijk 


'kjk ^ 


1=1 
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Algorithm 1 WALS: Weighted Alternating Least Squares 
input PuiT), r, Lh(>V), b, U 
1: Divide D in r * * 6 equal random subsets, i.e., D = {Di,..., 

2 ; for t = 0 to 6 — 1 do 
3; for g = 1 to r do 

4: = argmin^gjj- -u®Uq®Uq - aiUi ^ Ui ^ Ui)\\l. 

5: = \\U^+% 

6: t/‘+l = ^*+Vrr'll- 

7: end for 

8 : U^U^+\ 

9: S ^ S*+b 

10; end for 

output Completed tensor T = Yl^i=i ®Pi®Ui. 


• Compute the updates Uq'^^ by solving the weighted least squares problems, 

= argmin \\R\(^ [Pq - u®Uq®Uq) ||^, 
uSK’^ 


for all (/ G [r], 

• Set fjq = ||C/q"''^|| and Uq = Uq'^^/aq, for all q G [r] and repeat the above steps for b iterations. 

Note that this minimization is fundamentally a non-convex problem, but as we will see (Theo¬ 
rem SH) , the iterates converge to the global optima given sufficient number of samples. 

Algorithm [1] needs good initialization with constant distance to the true factors as an input. As we 
have seen in the previous section lLemma l3.2jl . factors of the sampled and reweighed tensor R^{T) 
satisfy this condition for big enough number of samples m. Also we need to threshold each entry 
of U at 2||([/*)*||. We can estimate these values from the samples. 

and |C/.,-|<2||(C/*n| Vi G [r]. (6) 

Finally we assume that every iteration uses independent set of samples. This is to avoid depen¬ 
dence between successive iterates in the analysis. However this seems to be not required in practice 
as noticed in the simulations section. Now we will present the result about exact recovery of any 
orthogonal tensor using algorithm [TJ 

Theorem 4.1. Let T = ® ® ® rank-r orthogonal tensor. Let H he gen¬ 

erated according to ©. Then the output of algorithm [I] with initialization satisfying ([6|) after 
b = 0{4:y/r\og{\\T\\F/^)) iterations satisfies the following: 

\\T-f\\<e, (7) 

with probability >1 — ^, for number of samples m > 0((X]j ||(?7*)*|| 2 log^(n) log(||T||ir/e)). 








Remarks: 


1. Number of samples needed for exact recovery is < 2m with high probability (from Bino¬ 
mial concentration). Hence we guarantee exact recovery of any rank-r orthogonal tensor from 
II (17*)®|| log^(n)) samples which is < log^(n)). This follows from Yhi II {U'* 

So for tensors with biased factors where ||(t^*)*|l^ is a constant, need only Oin) sam¬ 
ples for exact recovery. The worst case is w hen the factors are incoherent and our sample complexity 
0{n^'^) matches that of ( Jain fc Oh . 20141 ). This is the first such result to guarantee exact recovery 
of arbitrary orthogonal tensor and characterize the sample complexity for higher order tensors as 
far as we know. 


< 


2 . The theorem generalizes to an order-d tensor T S M”®"* with distribution 

and sample complexity m > ©((XlILi log^(n) logdiTlIiT’/e)). 


3. Algorithm [T] maintains only the factors of the tensor and the samples in each iteration. So it 
needs only 0{n * r + m) memory. Further since each iteration involves solving a weighted least 
squares problem, the computation complexity of the algorithm is 0(mr + m + n)r * log(||T||_p/e), 
which is just 0{mr‘^ log(||7~||i?/e)). Hence this algorithm has low computation complexity and fur¬ 
ther each iteration can be easily parallelized. 


4. The proof of Theorem 14.11 similar to the proof technique of (jJain &: OhI . l2014l h involves show¬ 
ing a distance of the factors in the current iterate to the optimum decreases in each iteration 
fLemma IC.2p . However our sampling distribution is not exactly uniform and the underlying ten¬ 
sor is not incoherent, so we have to carefully use the properties of the distribution ([5]) to show 
convergence for arbitrary factors. The complete proof is presented in Section [Cj 


4.1 Discussion 

Now we will discuss the intuition for the sampling ([5]). The distribution ([S]) is important to guar¬ 
antee exact recovery. The key idea is, distributions like LI, L2 and ([8|) do not sample enough 
entries corresponding to biased factors and some distributions like ([9]) do not sample enough entries 
corresponding to the unbiased factors. Proposed distribution ([5]) achieves the right balance. 

Clearly with uniform distribution one cannot guarantee exact recovery unless one samples all 
the entries (for example consider rank-1 tensor which have single non zero entry). Now consider 
data dependent distributions where probability of sampling an entry is proportional to magnitude 
of the entry (LI) or magnitude squared (L2) of the tensor. Now we will present a counter example 
for these distributions. 

Claim 4.2. There exists a rank-2 tensor for which sampling with LI or L2 distributions, error is 

3 

bounded away from zero, for number of samples m < w.h.p. 
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Proof. Consider a rank-2 block diagonal tensor with the first block of size log^(n) of all ones and 
the second block of size (n — log(n))^ of all ones. The factors of this tensor are, 


ui = [ 


1 


1 




, 0, • • • , 0]^ and U 2 = [0, 


, 0 , 


1 


1 


log(n)-2 


y^n — log(n) ’ ’ y^n — log(n) 


Now with LI sampling expected number of entries seen in the first block is ~ m * which 

3 

is less than 1 for m < Similarly L2 sampling also fails to sample the hrst block. Hence the 

error is bounded away from zero. 

For the proposed sampling ([5]) expected number of entries sampled in the first block is ~ 
m * • Hence the complete block is sampled for m > 0(n^'® log^'^(n)). □ 


Now consider more biased distributions. 

||(C/*)i3 + ||([/*)i||3 + ||([/*)fc||3 

Pijk — 


and 


Pijk 


MEiWmT? 


*\i||3 


Claim 4.3. There exists a rank-2 tensor for which sampling with distributions 
is bounded away from zero, for number of samples m < n^/log^(n), w.h.p. 


or 


( 8 ) 


(9) 


error 


Proof. Consider the same example as in Claim the rank-2 block diagonal tensor with the Hrst 
block of size log^(n) of all ones and the second block of size (n — log(n))^ of all ones. The factors 
of this tensor are, 


ui = [ 


1 


1 




, 0, • • • , 0]™ and U 2 = [0, 


, 0 , 


1 


1 


log(n)-2 


■s/n — log(n) ’ ’ y^n — log(n) 


Now with distribution ([8|), expected number of entries sampled in the first block is ~ m * 
„2 25 iJgi-5(n) • I^^'^ce for m < first block is not sampled w.h.p. and the error is bounded away 
from zero. 

Now consider the distribution ([9]), expected number of entries sampled in the second block is 
~ 7 TT, * iHS—hd _ Hence for m < n^/log^(n), number of entries sampled is strictly less than n —log(n). 
Since second block has n — log(n) faces, atleast one face of the tensor is not sampled along each 
dimension and hence cannot be recovered. 

□ 


5 2-pass Approximate Tensor Factorization 

In this section we will present a new algorithm to compute factors of an orthogonal tensor corrupted 
by noise. Our algorithm needs only two passes over the data unlike the existing algorithms which 
need to access the data over multiple iterations. Let T = where U* € is 
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an orthonormal matrix and £ is arbitrary but bounded noise. We specifically make the following 
assumptions on the noise. 

IlfllF < and ||£||» < 1^, (10) 

where ||^^||oo is maxjjfc 

Now we will describe the algorithm we use to compute the factors of the tensor T. The al¬ 
gorithm consists of two parts. First we sample the entries of T according to a specific biased 
distribution and then use algorithm [1] with the sampled entries to compute the factors. 

Sampling: Now we will describe the distribution used to sample the tensor. Note that unlike in 
previous section, the tensor is not exactly low rank and so the distribution is modihed to account 
for the noise. Consider the following distribution which can be computed easily in one pass over 
the tensor. 


3 3 3 3 3 3 

Pijk = 0.5-^-h 0.5 


3nZ 


\\T\\y 


(11) 


AH/llPT'P 77 ' — IIT,.,■ II P 

wnere — ||.^||^ 


+ and ^ normalizing constant. is the Frobenius 


norm of the fth face of the tensor and 1171, 




= T.jkVjk- Note that 


we use 


TTfi 


as an estimate 


for II([/*)*II. 

We compute factors U of the sampled tensor R^{T) using RTPM and use them for initializa¬ 
tion for the second step of the algorithm. Note that we also threshold the factors such that Un < 2vi. 


WALS: The second part of the algorithm uses the samples from the first part and computes the 
factors using the WALS algorithm [H The intuition is, if the underlying tensor is exactly rank-r, 
then this reduces to the completion setting discussed in the previous section and algorithm [T] will 
indeed recover the underlying factors. Since the tensor is not exactly rank-r it will introduce an 
error in the recovered factors. 


The pseudocode of the algorithm is given inO Now we will present the main recovery result. 

Theorem 5.1. Given a tensor T = where U* € is an orthonormal matrix 

and £ satisfies assumption (IloD, the output of Algorithmic satisfies the following: 


\\Ui-Ut 


< 


12||T|| 


cr; 


||T||f ^fz 
-I- e—^— * 


O': 


n 


0.25 ■ 


yi G 


and 


\\f-T\\<A8rK\\£\\+e\\£\\F*^, 

with probability > 1 — ;^, /or m > log^(n) log(4-y/r||r||r’/e||T||r’)). 

Remarks: 


11 













Algorithm 2 Approximate tensor factorization 

input Tensor T, number of samples m, rank r, iterations b. 

1: In one pass over the data compute Vi. 

2 : Compute samples Pn{T) from the tensor according to distribution dill) in one pass over the 
data. 

3: Compute factors U using robust tensor power method from (T) . 

4: Threshold Uij at 2t'i,V(i,j). 

5: =WALS(Pn(r),0,r,Pn(l/p),6,C/). 

output {Ui,ai}i(z[r\- 


1. Note that Z = ^ Hence ^ < 2. So for a tensor concentrated in few 

entries, Z can be as small as a constant and hence the error is smaller for such tensors. 


2. In the error exp ression above, the f i rst ter m 0(||T||), arises even for algorithms that access 


the complete tensor ( Anandkumar et ah . 2014al 1^1. The second term in the error 0(e||T||i?), is the 


approximation error and decreases with increasing number of samples (m) 


3. The assumptions on the noise dlOp are satisfied by entrywise random Gaussian noise. Let £ be 
a random tensor with ~ AA(0, 11^11°° - ’^ith 

high probability and £ satisfies (fTOl) . 

Computation and memory: Algorithm [2] has a complexity of 0{nnz{T) + mr^) as the sampling 
step takes 0{nnz{T)) time and the algorithm [1] takes 0{mr‘^) time. Hence by Theorem 15.11 the 
complexity becomes 0{nnz{T) + log^(n) log(||T||i?/e)). Further the sampling part of 

algorithm needs to read data and store only 0(n) numbers corresponding to distribution dlip and 
the WALS step needs only 0{m + n* r) memory in each iteration. 


6 Simulations 

In this section we present some simulation results comparing the proposed sampling technique to 
other distributions on synthetic examples. First we will present results for tensor sparsification 
followed by tensor completion and approximate factorization. 


Tensor sparsification: We will now discuss the parameters of the simulations. We construct sym¬ 
metric 100* 100* 100 order 3-tensors. We generate p random unit vectors Xi and the corresponding 
tensor Xi 0 Xi 0 Xi. We plot the error wit h the increasing num ber of samples m. Note that 
computing spectral norm of a tensor is NP-hard (jHillar &: Liml . l2013l i . Hence we use the following 
approximation of spectral norm as the error measure. IITII 22 = ll'7i,:,:|P) which is 2-norm of 

spectral norm of each face of the tensor. Note that since the tensor is symmetric we can consider 
faces along any dimension. 

We compare the error performance with the following distributions: uniform, L2: 

Sum L3: pijk oc ||X*|P + and the proposed distribution Tensor L.S. pijk oc (|5|). 

In the figure [T] we compare performance of various sampling distributions as we increase the 


Pijk « T^jk^ 
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Figure 1: In this plot we compare error of various sampling distributions used to sample a random 
tensor T = Xi® Xi® Xi, as we increase the number of sampled entries. Notice that since we 
cannot compute the spectral norm of the error tensor we compute L 2,2 norm of the error, (a): In 
the first plot we consider a tensor formed from random vectors X^. For such tensors we notice that 
most sampling distributions including uniform work well, (b): In this plot we create tensor from 
biased factors D * Xi, where D is a diagonal matrix Da = ^ with a = 0.5. In this case we notice 
that the proposed sampling distribution achieves smaller error compared to other distributions. 

number of samples. For this plot we create tensor from random samples, T = Xi®'^ with 

p = 50. For plot [D^a) we generate Xi from random Gaussian vectors. For plot [IJb) we bias Xi 
according to a power law with diagonal matrix Da = and use D * Xi. We set a = 0.5. This 
generates tensors concentrated in fewer elements and hence uniform sampling trivially incurs more 
error. We see that the proposed sampling distribution has the smallest error as we increase the 
number of samples. 

In figure [2] we plot error performance as we increase number of sample vectors p for various 
distributions. We fix the number of entries sampled from the tensor at m = [10 * As we 

increase the number of vectors p the approximation becomes worser and the error increases. Again 
in[2][a) we use tensor constructed from random vectors Xi and most distributions have similar error. 
In [2][b) we consider tensor constructed from biased random vectors D * Xi, with a = 0.5 and we 
notice that the proposed sampling distribution has smaller error. 

Tensor completion: In figure [3][a) we plot the performance of algorithm [TJ We consider rank-5 or¬ 
thogonal tensors T = Ui®Ui®Ui,U = SVD{D*X) with varying bias a, and plot the number 
of samples needed for exact recovery of various sampling techniques. We show that proposed sam¬ 
pling distribution (tensor L.S) needs (approximately) same number of samples irrespective of the 
bias (a) of the factors. Other distributions need increasingly more samples for recovery as the bias 
of the factors increases. 

Tensor factorization: In figure [31(b) we plot the performance of algorithm [2l We construct random 
orthogonal tensors with noise T = Ui®Ui® Ui + E, where £ is an entrywise random Gaussian 
tensor. We compute RMSE of the recovered factors with the true factors and plot this on y-axis 
vs increasing norm of the noise ||T||f) on the x-axis. Again we notice that the proposed sampling 
distribution has smaller error compared to other distributions. 
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Figure 2: In this plot we compare error performance of various sampling distributions, used to 
sample a random tensor, as we increase the number of sample vectors p. Note that as we increase 
the sample vectors p the approximation becomes bad and the error increases, (a): In the hrst 
plot we again consider a random tensor T = Xi® Xi® Xi, and most sampling distributions 
including uniform have similar error, (b): In this plot again we create a tensor from biased factors. 
In this case we notice that the proposed sampling distribution achieves smaller error compared to 
other distributions. 




Figure 3: (a): In this plot we compare the number of samples needed for exactly recovering a rank-5 
orthogonal tensor from different sampling distributions using algorithm [H T = Ui®Ui®Ui^ 
Ui are orthogonal biased vectors with U = SVD{D * X), where X is a random matrix and D is a 
diagonal matrix with Du = With increasing values of a (x-axis) the tensor becomes concentrated 
on fewer entries. On y-axis we plot the number of samples needed for successful recovery (RMSE 
< 0.01) in more than 80% runs. The proposed sampling distribution tensor L.S is able to recover 
the tensor from smaller number of entries even if the tensor gets biased, (b): In this plot we 
consider the performance of algorithm [2] in the noisy tensor case D = Ui ® Ui ® Ui + E. £ 

is an entry-wise random tensor. We plot the RMSE of the computed factors from algorithm [2] as 
the noise Frobenius norm increases. We notice that the proposed sampling distribution has smaller 
error. 
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A Concentration results 


In this section we will review the concentration results we will be using in our proofs. 

Lemma A.l (Bernstein’s Inequality). Let Xi,...X„ he independent scalar random variables. Let 
\Xi\ < L,Vi w.p. 1. Then, 




2 = 1 


2=1 


> t 


< 2 exp 


-t^/2 


Er=iVar(X,)+Lt/3y ' 


( 12 ) 


Lemma A.2 (Matrix Bernstein’s Inequality IXroppI (120121 )1. Let Xi,...Xp be independent random 
matrices Assume each matrix has bounded deviation from its mean: 


\Xi — E [Xi] II < L, Vi w.p. 1. 


Also let the variance he 
p 


a = max ■ 


E 


-E[Xi]){Xi -E[Xi])^ 


_ 2=1 


E 


"^{Xi-E[X,]f{X,-E[X,]) 


.2 = 1 


Then, 


P 


^(X,-E[X,]) 


2 = 1 


> t 


< 2 nexp 


( -tV2 \ 

Vcr^ + Lt/3 ) 


(13) 


Now from ( Nguven et ah . 2O10l l we know the following bound on spectral norm of a random 
tensor. 

Theorem A.3. Let T £ order-d tensor and let T be a random tensor of same di¬ 

mensions with independent entries such that E T = T. For any A < ^, and 1 < q < 2ndXln{^), 
then: 


E 


IIT-r||l 


< c8'^\ 2dln 


log2 


d-l I d 

T. 


, (14) 


where 


2 djf 

Oj — 


ma,x 
^1 }■ — 1 + 1 


andl3=Ai.ax 




'Til—id 


B Proofs of Section [3] 

First we will give certain properties of the sampling distribution. Recall 

_ \\X^f\\X^f + ||XJf ||A^f + llx'^f ||A*f 

’ 

where IAII 3 = Also we will use 6ijk to denote indicator random variable throughout 

the proofs. 
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Lemma B.l. Given a tensor T = Xi® Xi and distribution pijk defined in equation ([T]) 

the following holds, 


Pijk 


< n\\X\\l y{i,j,k). 


(15) 


Proof. First using Cauchy-Schwarz inequality we get, 

p 


1=1 






Y,^^ki<X\\\XUxH- 


Also by AM-GM inequality we get, 


||A*f ||AJf -h ||AJf ||X^f + ||A^f ||A*f ||X*|p||XJ||2||X'=||2 

Pijk = -^- > 


3n||X||i 

Hence the first inequality follows from the above two equations 




□ 


Now we will provide the proof of Theorem 13.11 We will use the relation between the tensor T 
and the probability distribution pijk through Lemma IB.11 


Proof of Theorem \3.1\ To bound 


T-T 


we will use the concentration theorem IA.31 Let TL = 


T — T- Now if Pijk > 1) then TLijk = Tijk ~ Tijk = 0. Hence we only consider the cases for which 
Pijk m * Pijk ^ !• 


We follow the same strategy of ( Nguven et ah . 2010l i to bound 


T-T 


, by dividing the indices 


T-T 


{i,j, k) into various sets and bounding the error 
tensor with only entries such that Pijk > ^ anc 
of T satisfying pijk G [ 21 ^! W^n)' Similarly we dehne T^^^ corresponding to sampled and rescaled 
entries of T^^^. Also let s = [log(n^/^/In^ n)]. Hence using triangle inequality we get, 


over each set separately. Let denote 
similarly let T^^^ be the tensor with only entries 


< 


■^[1] _ 7"[i] ll-^W _ 7"W _|_ I_ 7"W 

1=2 «=s-|-l 


T-T 

Now we will bound each of the above three terms in the summation. 


1 = 1 case: 

Let n = - rt^l. Then 


^.2 / 'Tijk n\\X\\l C 2 2n\\X\\l 


m-p-jk m^Pijk 

where Ci follows from (IT^ and C 2 follows from pijk > Hence \'Hijk\ < This implies 




2„||X||2 


m 
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Now applying the tensor concentration theorem lA.31 for A = ^ and q < 5n/8 gives ns, 

V m 


1< |'log(n^/^/In^ n)] case: 

Again let Ti = Then, 


K 


< 


' ijk 


n\\X\ 


- “VI,, 


S „2 


< 


^ Pijk 


2^n\\X 

m 


l|2 

II 3 


Further E 
we get. 




r 

r / 

^ ^.0 \^1 

maxjk n 

-- .,9 

Li '^Ijk) 


maxjk ( 

l^i ^ijk) 


Hence using the above two equations 


E 


max 

jk 


E« 


2 

ijk 


< 


2'n||A||2 

m 


g/2 


E 


max 

jk 




ijk 


Now using Lemma 17 from (jNguven et ah . 2O10l i we know that E [max^^ < 2(5n2 ^ + 

61n(n) + 2qy. Hence from the above two equations and using theorem IA.3I gives us, 


E 


_ 7"W 




log2 ( ^ 


+ (31n(n) + q) * 2'+^ + VA2^n 


n\\X\\l 


m 


1> |'log(n^/^/ln^ n)] case: For this case note that pijk < ■ Hence from (fl^ 7^^^ < 

nil Alio Since the elements of T are small in this case the error is also small. Hence, 

^ < ^Jn\\X\\lnyHnyn)/m. 

V ijk 

Applying Markov’s inequality with q = 61n(n), combining the above three bounds we get 

In^(n) 


r-r 


< c- 


m 


-^Uh 


with probability >1 —V- 


□ 


C Proofs of Section [4] 


In this section we will present the proof of theorem 14.11 First we will give certain properties of the 
sampling distribution. Recall 


Pijk 


||(l/*y||3/2||([/*)i||3/2 + ||(^yj||3/2||^^yfc||3/2 ^ || fc ||3/2 || ||3/2 

MU*\\1/2 


where ||17* II3/2 = Yll=i "^Iso we will use 5ijk to denote the indicator random variable 

throughout the proofs. 
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Lemma C.l. 


Tijk 
Pijk 


Pijk 

iu:,)iu;,)iUk,) 

Pijk 


— 3n||f7 113 / 2 - 


<n\\U*\\l/,. 


(16) 

(17) 

(18) 


Proof. Recall 




n"* 

kl — ^max-i 


l=l 


X 'ZWi) ,. 

\i=i \i 




* \k\ 


1=1 


Also by AM-GM inequality we get 

||(J7*)i||3/2||([/*)i||3/2 ^ ||(f/*y||3/2||([/*^fc||3/2 ^ || || 3/2 || 3/2 ^ || (t/* )* || || (17* )^'|| || (t/* )^ | 


Pijk — 






Hence the first inequality follows from the above two equations. 


For proving second inequality we use the fact that ||(t7*)'^|| < 1- Hence 




— 3n||17 II 3/2 


®- 


p^,k ■■ "^'~\m*y\\2\\iu*) , 

The proof of third inequality follows from 

||(t/*)*||||(t/*)i||||(t/*)" 


<3n||H*||^/2- 


Pijk ^ 


n\\U* 


3/2 


□ 


To show that the algorithm [T] recovers the underlying factors we show that each iteration 
decreases the distance to the true factors. For this we define the following notion of distance. Let 
Ui and ai be iterates at the end of iteration t. Then 

doc([t/,S],[C/*,S*]) = max(||di||+A0, (19) 

where \\di\\ = \\Ui - Uf\\ and A; = ^ ■ 

Now we will show that the distance to iterates at the end of t+1 iteration decreases geometrically. 
Theorem C.2. Let doo([t7, S], [17*, S*]) < andU satisfies (l23|) . then, 

doo([t7(^+'\s(‘+i)],[C7*,S*]) < idoo([ll,S],[[7*,S*]), 

with probability greater than 1- for m > 0{{Yf,i\\U''\\"i)‘^nr^Kfi\o^{n)). Further [7^+1) 

satisfies (f23]l . 
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Proof. Recall that T = X][=i ® ® = 8'i'g™™ueR" W^nt^^+q O' ~ ~ 

Hence, 

-,+i EJk^^qkWiqkU;^u*,^u,,Ukq ^Ejk^vkWiMOu:iU;,u*,^-aiUuU,iU^^^^ 

( 20 ) 


Now we will show that the distance between update and U* decreases with each iteration 
by expressing t he up date in terms of tensor power method update and error similar to the proof 
of (Jain &: OhI . l2014l h For the rest of the proof we will use the same notation as in ( Jain Sz OhI . 


201 4l h 


UqfU* - B-\a;{U;,UqfB - alc)u; 

Ij^q l^q 

( 21 ) 

where B,C,F^^\G^^^ are diagonal matrices with, 

= J2 ^rjky^ijkU^qUlq, Cu = jqU*qUkqU*k^, 

jk jk 

Fu =Y.^^F^^FUjqU;iUkqmi, and =Y,^^jk'^^jkUjqUjlUkqUkl. 
jk jk 

Now dehne the error 


err" = u*((t/*, t/,)^ - l)t/* - B-\a;{U;,U,fB - a*^G)U; 
err} = {a}{U},UifUf - ai{U„UifUi) 

err} = B-\a}{{U,, U}fB - F^^^)Ut - aii{U„ UifB - G^^'^)Ui) (22) 


The goal is to bound each of these error terms in terms of distances ||(i;|| and A; so as to 
express doo{[U^^~^^'>, [U*, S*]) in terms of dooi[U, H], [U*, S*]) . Now we will bound the error 

= err^ + X);^g(err/ + err}). By Lemma fCAl and [C..SI we get. 


errO < u*(l - (t/*, t/,)^ + - {U;,U,)^) < u*||d, 

Using Lemma B.IO and B.ll of Jain fc Oh ( 20141 ) we get, 

||err/|| < ^cr*i{\\di\\ + ||dq||)(||di|l + 


+ 27). 


and 


at{{Ug,Ut)^B - G^OUf - ai{{Ug,Ui)^B - F^^Ui = {{Uq,Uf){Uq,di)B - DW)Uf 

+ i{U„ umuq, di)B - + (([/„ di)^B - - i{U„ Ui)^B - F^^di 

< 8 lO\\di\\- 
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Note that are diagonal matrices with = Yljk^ijk^ijkUjgUkqdi{j)U^i, = Yljk^ijk^ijkUjqUkqdi{k)U*i 
and = Yljk^ijk'^ijkUjqUkqdi{j)di{k). Cl follows from Lemma IC.51 Lemma IC.Gi Hence 

llerr^^ll < 87 ( 7 j*(||d/|| + A;). Combining the bounds on all the error terms and setting 7 = we 

get 




lie - a*qU;\\ < r^WdqW + i/i6a*^M[u,nAu*,j:*]). 


Further \a\+^ - u*| < luj+^e' " < ^doo([C/, S], [C/*, S*]) and <|ie^ " ^^^11 < 


qW'-q 


-dooi[U, S], [U* , S*]). Hence combining these two equations we get, 


1 


|ieil + A*+i<-do,([H,S],[[/*,S*]). 


g II ' q - 2 

Now we will prove the second part of the theorem 


*9 


^ l^iil j-j^ I \ ^ 


— 9 


Bj 


cr, 




Gu 


+ Vur- 



< 


l^q 
1 + 7 


|H 


l^q 


Bi, 


m 


IKOil + ll-iill) + + Ai)(7 + II4.II) 

\ i¥=<} 

< ii([/*niu*(i +1/100), 

since 7 < 3 ^ 5 ^- Using the bound on Icr*"''^ — a*\ from above the result follows. 

Proof of Theorem\4-l\ The proof now follows from Theorem lC.2[ After log(4-y/r||T||ir/e)jteratioi^ 


□ 


the error \\Uq — Uf\\ < 


9 II - a^ItWf ^ 9 ! - 4VF|miF' 


20141 1 it follows that ||T — T|| < ||T — T\\f + e. 


Hence from Lemma 2.4 of (jJain &: OhI . 

□ 


C.l Supporting lemmas 

Lemma C.3. For H generated according to (l5|) and U satisfying (|23F there exists a constant C 
such that the following holds: 


E ^^,kW^qkUqqU;^Ukqmq - {Uq, U, 
jk 


*q? 


< 1 , 


for any fixed q, with probability greater than 1 — for m > ^n\og{n)\\U*\\'^i^. 

^jqUkqUkq- v... ^ >■ yzjK^ yj;* 


Proof. Let Xjk = dijkWijkUjqU*MkqUr. From m we get |X,fc| < < 


Also, 

m ’ 


E 


jk 


jk 


3n\\U*"^ 


3/2 


m 


T.vj,v, 

jk 


2 

kq 


M\u%^ 


m 


Hence by applying the Bernstein’s inequality the result follows. 


□ 
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Lemma C.4. For generated according to ([5|) and U satisfying (|23ll . there exists a constant C 
such that the following holds for any fixed b € M”/ 


- u:^{u„u;){u„b) 


jk 


< 


with probability greater than 1 — for m > |^nlog(n)||C/*|| 3 ^ 2 - 

, , , 12n||r/||2 J|fe|| 

Proof. Let Xjk = dijkWijkUf UjqU* Ukqbk- From ([HD and (|23D we get \Xjk\ < --. Also, 


E 


jk 


iq'^ JH'- jq" 

<Y.mjk{U*qU,qU*qUkqU*kqbkf < 
jk 


12 n\\U\\l^ 


m 


jk 


12n\\U\\l,\\bf 


m 


Hence by applying the Bernstein’s inequality the result follows. 


□ 

Lemma C.5. Let Q be generated according to ([SD, U satisfying (f23D and fixed unit vectors a,b 
and c in M” such that |aj|, |6il, |cj| < 2||(C/*)®||VL Let B and R be diagonal matrices with Bn = 
'^jk^ijk^ijkUjqU^q and Rn = '^jk^ijk^ijkUjqUkqO.jbk- Then, there exists a constant C such that 
the following holds: 

\\i{Uq,a){Uq,b)B - R)c\\ < 7^1 - {Uq,a)^Uq,b)^ , 
for any fixed q, with probability greater than 1 — for m > ^nlog{n)\\U*\\^^ 2 - 
Proof. Let X^j}^ — dij}fV\^ij}^CiUjqU}^q{UjqU}^q(fJq.fa'){Uq.,h') ajb]f)ei. Note that — 0. 


Cl 8n * ||C/*||3/2 

\Xijk II < - Y.^UjqUkq{Uq, a) {Uq,b) - U.bk) 


8n*||C/||3,2 /- 

„) 2 ([, 6 ) 2 , 

mV 


Cl follows from (fTHp . 
Also, 




Y, WiqkcjUjqUiq{UqqUkq{Uq, «) {Uq, b) 


ttjbk) 


Cl 48n\\U*\\l,^ 

< - ^{l-{Uq,af{Uq,bf). 


m 


Cl follows from (|17p . Hence the lemma follows from applying the matrix Bernstein inequality. □ 

Lemma C.6. Let Q be generated according to (0, U satisfying (|23D and fixed unit vectors a, b 
and c in M” such that |aj|, |6i|, |cj| < 2||(17*)*||, Vi. Let B and R be diagonal matrices with Bn = 
'^jk^ijk^ijkUjqU^q and Rn = '^jk^ijk^XijkUjqUkqO.jbk. Then, there exists a constant C such that 
the following holds: 

\\{{Uq,a){Uq,b)B-R)c\\<j\\b\\, 

for any fixed q, with probability greater than 1 — for m > ^n\og{n)\\Lf*\\‘^i 2 . 
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Pvoof. Let ^ijk — ^ijk'^^ijkC^iUjqUfii;qQjjbf^€i. TllCIl IE — Ci{Uq-, (X) {Uq^, b'). 

^ 12n||C/||^/,||5|| 

|2C^Ji'/c| ^ 


Further 




ijk 


kCp]^Ul^api 


< 


l2n\\Uf 


3/2 


m 


2 


Hence \\{{Uq,a){Uq,b) — R)c\\ < 7 || 6 || from matrix Bernstein’s inequality. 

From Lemma IC.31 \Bii\ <1 + 7 . Hence applying union bound over all i we get \\B — I\\ < 7 . 
Hence the lemma follows. 


□ 


C.2 Initialization 


From Theorem 5.1 (jAnandkumar et al.l . l2014al l we know that Robust Tensor Power Method (RTPM) 
gives a good approximation of factors for small error. 

Lemma C.7. Let ||Ro(T) —T\\ < 6, then clog(r) iterations of RTPM on Rn{T) achieves: 

\\Ui — UfW < curd, and\ai — erf | < aiKrd, 

with probability greater than 1 — l/n^, for all I G [r]. 

We further threshold entries of U such that Un < 2||(t/*)*||. Note that we can estimate these 
quantities from the samples. Hence this guarantees that initialization satisfies 

\Uii\<2\\{U*)% 


(23) 


D Proofs of Section [5] 


In this section we will present the proof of Theorem 15.11 The proof follows the same way as 
the noiseless version with key modiheations which we will discuss now. We will first present the 


key properties of the sampling distribution which we will use in the rest of the proof. Recall 
Pijk = 0.5 +'^k 7 ^ 0.5^, where Ui = and Z = is the 


IITIIf 


vT 


normalizing constant. Recall Tijk = 


D.l Initialization 

First we will show that sampling (Hill followed by RTPM generates a good approximation to the 
underlying factors. 

Lemma D.l. Given T = ® ® ^ where U* is orthonormal matrix and £ 

satisfies cni), the output U of step 4 of algorithm satisfies the following: 

\\Ui-U*\\<r ^— and |Ljj| < 2z^j,Vi £ [r], 
lOOrK 

with probability > 1 — ^ for m > log^(n)K'^). 
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Proof. First from Theorem 1 of ( Nguven et ah . 2 OI 0 II we get that ||i?n(T) — T\\ < e||T||_F, for 
m > log^(n)). Note that by triangle inequality, and equation (fT0]l we get, 


imiF< 


\j i=i 


Hence, 


\\Rn{T) -Y^atUt ® Uf ® Ut\\ < ||T|| < e{2a*^,,V^) + 


1=1 


lOOr 


for e < fi ; which is true for m > 0{n^'^r^K'^log^{n)). Hence the factors U computed using 
RTPM on Rq{T) satisfies 

1 


\\Ui-Ut\\ < 


lOO^r 


, yi e [r]. 


from Theorem 5.1 of ( Anandkumar et ah . 2r)14al ). for m > 0{n^'^r^K^log^{n)). 

Further we threshold each entry of U such that U satisfies (I24p . Note that the proof that 
thresholding step doesn’t increase the distance to the opt imal factors by mo r e tha n a constant 
factor, follows the same way as in proof of Lemma 3.2 in ( Bhoianapalli et ah . 2015l i and we will 
not discuss it here. 

□ 


D.2 WALS 


Now before we present the proof of Theorem 15.11 we will present some bounds on the error because 
of the noise in each stage of the WALS algorithm. 

One key modification compared to the noiseless case is, we need iterates to satisfy the following 
bound in each iteration. 


\U^J\ < 2 




(24) 


Now we will discuss some key properties of the sampling. 

Lemma D.2. For U satisfying (I24p the following holds for distribution (jllh . 


Pijk 

{U^,mg)iUk,) 

Pijk 


< 96nZ. 


< 16nZ. 


(25) 

(26) 


Z is the normalizing constant in dill) . 

Proof. The proof follows the same way as proof of Lemma 1C.11 □ 

Since most of the supporting lemmas in Section ICl depend on the relations above, they all follow 
immediately. Next we will characterize the error by noise in each iteration of WALS. 
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Lemma D.3. For £ satisfying (|10|) and iterate U satisfying (1241) the following holds, 

iiEE ^ij k ij k ^ij k qUkq^i EE '^ijk^ijk^jqUkq^i^ 

i jk i jk 

with probability > 1 — for m> 0{^ log^(n)). 

Proof. Let Xijf^ ^ijk'^^ijk^ijkb^jqb^kq^i' TllGIl, 

a CV^Sjjk g CV^\\£\\f ^ CnZ\\£\\F 
“ rriytp^ - mnytp~jk ~ rn 

Cl follows from (j25p . C 2 follows from (jlOp . Cs follows from pijk > Now we will bound the 

variance. 


i jk 

h £ij^ * CnZ ^ \\£\\\ * CnZ 

I jk 

Cl follows from (I25p . Hence by matrix Bernstein’s inequality, the Lemma follows. □ 

Note that 


^ ^ ^ ^ '^ijk^ijkUjqUkq^i 

= 

Y,U^{£i,w)Uqe^ 

i jk 


i 


Hence the above lemma implies || Yhi Yljk ^ijky^ijk£ijkUjqUkqe-i\\ < ||f || + e||£i||_F, w.h.p. 




Lemma D.4. Let hoo([b^) E], [17*, E*]) < U satisfies (l2^ and £ satisfies (fTOl) then, 

E(*+i)], [t/*, S*]) < ^doo([t7, E], [7/*, E*]) + +p\\^\\F ^ 

^ '^min 

with probability greater than 1 — for m > log^(n)). Further satisfies (0^ . 

Proof. The proof follows the same line as proof of Theorem IC.21 Hence we only discuss the modi¬ 
fications caused from the noiseless case by the additional noise term. 




= 


Hjk^iqkmqkUlUl^ 


+ 2^ 

l^q 


Ejk^ijkW,jWf,ui^ 

^ijk^ijk^ijkUjqU}^q 

* T.ikiviyVijWj,vl, ' 


(27) 
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From Lemma ID. 31 and 1C.31 we get the following bound on the norm of noise term in each 
iteration, 

II E S 2(l|f II + e||f 11^), 




Hence 


- <[/*|| < -^\\dg\\ + l/16a*^M[U, S], [U*, S*]) + 2(||£:|| + e\\£\\F). 




Purther|u‘+i-c7;| < -a;U*\ < $doo([C/, S], [C/*, E*])+2(||f ||+e||^||ir) and a;||C/*+i- 

^q\\ ^ ^dooilU,!^], [C/*,S*]) +4(||£’|| + e||£’||i?). Hence combining these two equations we get, 

ii4+i+< idoo([c/, E], [[/*, s*])+ 


Now we will prove the second part of the theorem. 




\Cii\ 

\Bii\ 




Gu 


+ Var- 



< 


l^g 
1 + 7 


IB 


l^q 


Bi, 


m 


iKoii + E'^i'h + ii‘''ii) + E<'i*(i+ ai)( 7+ii4.il) 


l^q 


l^q 


< ii([/*ni<(i +1/100), 


since 7 < Using the bound on Icrg"*"^ — cr*\ from above the result follows. 

Further to show that the iterates satisfy conditions (j24p consider the following. 


*9 


<||{t/*)‘K(l +1/100) + ^ dijk^ijk^ijkUjlUj^l + 2(Tj^g^,^( 

jk 




IITIIf 


To bound Yljkdijk'^ijkSijkUjiUki, note that 


m,k£zjkUjiUki\ < 


rn^"^m 


y/Pijk 


m rn 


0.75 


and 


^w„k£y4uh <^Ey ^ 


2 < «m)^ 


m rn 


jk jk 

Hence with high probability by Bernstein’s inequality we can say that 


^ dijkWjjk^ijkUjlUkl + ^ £ijkUjlUkl + 
jk jk 


^min ('1 _j_ 1 7 ^min 

^ ^ lOO'^ ./n 


n 


Hence the result follows. 


□ 
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Note that for m > log^(n)), the error in the above lemma decreases from to 

Now we have all the ingredients to present the proof of Theorem 15.11 


Proof of Theorem \5.1[ The proof now follows from Lemma ID. 41 After log(4-y/r||T||p’/7) iterations, 


\TT — TT*\\ < _ 2 _ 

\^q ^q\\ 2^ 4V7 \\T\\f 


+ ll^ll^ + and \aq - cr*q\ < 


Hence, 


4^T||r|h 


+ \\e\\^^ + e\\8\\F- 


Y. mUi II < E 1^' - <1 + E 11^^ 


\\Ui (g)'" -a*iUf II < \\{Ui - Ut) ®Ui® UiW + \\Ui ® {Ui - Uf) ® Uf\\ + \\Ut ® Uf ® {Ui - 

Hence combining the above two relations we get, 






7 


4V^lirih 




0.25 


< 7 + 48r«;||T|| + e||T||_F 




n 


0.25 • 


□ 
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